CAGEF_services_slide.png

Lecture 02: Data wrangling in R


0.1.0 About Introduction to R

Introduction to R is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.

The structure of this course is a code-along style; It is 100% hands on! A few hours prior to each lecture, links to the materials will be avaialable for download at QUERCUS. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with R by coding along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into R - when required. This learning approach will allow you to spend the time coding and not taking notes!

As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark) through DataCamp to help cement and/or extend what you learn each week.

0.1.1 Where is this course headed?

We'll take a blank slate approach here to R and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:

and get you to a point where you can:

data-science-explore.png

0.1.2 How do we get there? Step-by-step.

In the first two lessons, we will talk about the basic data structures and objects in R, get cozy with the RStudio environment, and learn how to get help when you are stuck. Because everyone gets stuck - a lot! Then you will learn how to get your data in and out of R, how to tidy our data (data wrangling), subset and merge data, and generate descriptive statistics. Next will be data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. After that, we will make all sorts of plots for both data exploration and publication. Lastly, we will learn to write customized functions and apply more advanced statistical tests, which really can save you time and help scale up your analyses.

Draw_an_Owl-2.jpg

The structure of the class is a code-along style: It is fully hands on. At the end of each lecture, the complete notes will be made available in a PDF format through the corresponding Quercus module so you don't have to spend your attention on taking notes.


0.2.0 Class Objectives

This is the second in a series of seven lectures. Last lecture we discussed the basic functions and structures of R as well as how to navigate them. This week we will focus more on the data.frame object and learning how to manipulate the information it holds.

At the end of this session you will be familiar with importing data from plain text and excel files; filtering, sorting, and re-arranging your data.frames using the dplyr package; the concept of piping command calls; and writing your resulting data to files. Our topics are broken into:

  1. Install and load packages for R
  2. Import data into R (tsv, csv, xls(x)).
  3. Data insepection with the base R functions.
  4. Use the dplyr package to filter, subset and manipulate your data and to perform simple calculations.
  5. Exporting your data after manipulating it.

Data-Wrangling-Is-The.jpg


0.3.0 A legend for text format in Jupyter markdown

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Geen boxes: Recommended reads and resources to learn R

0.4.0 Lecture and data files used in this course

0.4.1 Weekly Lecture and skeleton files

Each week, new lesson files will appear within your JupyterHub folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto JupyterHub. You will need to use your UTORid credentials to complete the login process. From there you will find each week's lecture files in the directory /2021-09-IntroR/Lecture_XX. You will find a partially coded skeleton.ipynb file as well as all of the data files necessary to run the week's lecture.

Alternatively, you can download the Jupyter Notebook (.ipynb) and data files from JupyterHub to your personal computer if you would like to run independently of the JupyterHub.

0.4.2 Live-coding HTML page

A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!

0.4.3 Post-lecture PDFs and Recordings

As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus. A recorded version of the lecture will be made available through the University's MyMedia website and a link will be posted in the Discussion section of Quercus.


0.4.4 Dataset 1: /data/miscellaneous.xlsx

An Excel book that we will be used to show how we can import even entire Excel books into R.

0.4.5 Dataset 2: /data/microbes.csv

This dataset is the result of 16S rRNA gene amplicon sequencing of samples from microbial communities cultured in fresh, brackish, or saline media. Treatments received the aromatic compounds toluene or pyrene as the sole source of carbon and energy. Controls did not receive any compounds (substrate-free) to account for any alternative carbon sources present in the media. The objective of this experiment was to evaluate which microorganisms would make use of toluene and pyrene.

We will use the microbes.csv dataset to learn how to manipulate our data using dplyr.


1.0.0 Installing and importing packages

I_made_this_package.png

Packages are groups of related functions that serve a purpose. They can be a series of functions to help analyse specific data or they could be a group of functions used to simplify the process of formatting your data (more on that later in this lecture!).

Depending on their structure they may also rely on other packages.

1.1.0 Locating packages

There are a few different places you can install packages from R. Listed in order of decreasing trustworthiness:

Regardless where you download a package from, it's a good idea to document that installation, especially if you had to troubleshoot that installation (you'll eventually be there, I promise!)

devtools is a package that is used for developers to make R packages, but it also helps us to install packages from GitHub. It is downloaded from CRAN.


1.2.0 Installing packages for your Jupyter Notebook (on JupyterHub)

Installing packages through your JupyterHub notebook is relatively straightforward but any packages you install only remain during your current instance (login) of the hub. Whenever you logout from the JupyterHub, these installed libraries will essentially vaporize.

The install.packages() command will work just as it should in R and RStudio. Find instructions in the Appendix section of Lecture 01 for installation of packages into your own personal Anaconda-based installation of Jupyter Notebook.


1.2.1 Will it or won't it install? Check for dependencies!

R may give you package installation warnings. Don't panic. In general, your package will either be installed and R will test if the installed package can be loaded, or R will give you a non-zero exit status - which means your package was not installed. If you read the entire error message, it will give you a hint as to why the package did not install.

Some packages depend on previously developed packages and can only be installed after another package is installed in your library. Similarly, that previous package may depend on another package and so on. To solve this potential issue we use the dependencies logical parameter in our call.


1.2.2 Use library() to load your packages after installation

A package only has to be installed once. It is now in your library. To use a package, you must load the package into memory. Unless this is one of the packages R loads automatically, you choose which packages to load every session.

library() Takes a single argument. library() will throw an error if you try to load a package that is not installed. You may see require() on help pages, which also loads packages. It is usually used inside functions (it gives a warning instead of an error if a package is not installed).

Errors versus warnings: So far we've seen that errors will stop code from running. Warnings allow code to run until an error is reached. An eventual error may not be the result of a warning but it certainly leaves your code vulnerable to errors down the road.

1.3.0 Loading packages from bioconductor requires BiocManager()

To install from Bioconductor you can either always use BiocManager() to help pull down and install packages from the Bioconductor repository.


1.4.0 Skip loading a library with package::function()

As mentioned above, devtools is required to install from GitHub. We don't actually need to load the entire library for devtools if we are only going to use one function. We select a function using this syntax package::function().

All packages are loaded the same regardless of their origin, using library().


1.5.0 Packages used in this lesson

The following packages are used in this lesson:

1.5.1 Quick info for how to install R packages in Anaconda

Jupyter Notebooks generally do a good job of installing packages but if you want a little more control over the process, you can do so via Anaconda. Open up the Anaconda prompt and install your packages of interest.

conda install - starting command to call the installer for anaconda. -c conda-forge - look for the package in the 'conda-forge' channel. r-packagename - the name of the package you're interested in installing.

Combine the parts into a single command like:

conda install -c conda-forge r-essentials # This will install tidyverse along with the other dependencies

conda install -c conda-forge r-googlesheets4


2.0.0 Reading files in R

taken_readr.jpg


The most important thing when starting to work with your data is to know how to load it into the memory of the R kernel. There are a number of ways to read in files and each is suited to dealing with specific file types, file sizes or may peform better depending on how you wish to read/store the file (all at once, or a line at a time, or somewhere in between!

There are many file formats you may come across in your journey but the most common will be CSV (comma-separated values), TSV (tab-separated values), FASTQ (usually used for storing biological sequences), or some archived (ZIP, GZ, TGZ) version of these. R is even able to open these archived versions in their native format! We may interchangeably use the word parsing to describe the action of reading/importing formatted data files.

2.1.0 Import data to a tibble with read_csv()

The tidyverse package has it's own function for reading in text files because the tibble structure was first developed as part of the dplyr package! If we'll be spending our time working with the tidyverse then we may as well use their commands for importing files! If you want to learn how to do this with the base R utils package, check out the Appendix section for details.

Let's look quickly at the read_csv() function which is a specific version of the read_delim() function from the readr package. The arguments we are interested in are:

From this point on, we'll pretty much use the terms tibble and data.frame interchangeably.


As you can see, it's a pretty smooth process to parse simple text files. We'll learn some additional functions as we become familiar with the tidyverse package as well.


2.2.0 Read excel spreadsheets with readxl package

What happens if we have an excel file? The readxl() package, which is installed as part of the tidyverse package, will recognize both xls and xlsx files. It expects tabular data.

Note that while we have already loaded the tidyverse package, we will need to explicitly load readxl so we can use the read_excel() function to accomplish our task. Some parameters we are interested in are:

First, let's open our excel file with read_csv().

Looks like it didn't work...


2.2.1 Retrieve excel sheet names with excel_sheets()

Why doesn't our output look like a workbook? The read_excel() function defaults to reading in the first worksheet. You can specify which sheet you want to read in by position or name. Let's see what the name of our sheets are using the excel_sheets() function.

The excel_sheets() function returns a character vector as output.


2.2.2 Subset sheet and range within read_excel()

If we want to get fancy, it is possible to subset from a sheet by specifying cell numbers or ranges. Here we are grabbing sheet 1 (microbes), and subsetting cells over a range defined by two cells - A2:D9.

For our purposes, the read_excel() function takes the form of read_excel(path, sheet = NULL, range = NULL) but there are additional parameters we can supply to the function. See ?read_excel for more information.


Caution: Note that we no longer have proper column headings! If you had a complex header or were subsetting by a range like this you would include some additional steps to first retrieve the meta data (column names) to update the resulting data.frame.

We could alternatively specify the sheet by name. This is how you would simply grab rows.

Note that if your first row is the header, excluding this row will result in data filling in the header unless you include the parameter col_names = FALSE.

Likewise, how you would subset just columns from the same sheet?

Using the range parameter: to learn more about the range parameter and using it with a series of helper functions, you can visit the readxl section on the tidyverse page.

2.3.0 lapply() is the list version of apply()

How would we read in all of the sheets at once? In one solution you can also use lapply(), a version of the apply() function, to read in all sheets at once. lapply() returns a list object of the same length as X, for which each element is the result of applying FUN to the corresponding element of X. Note that the elements of the returned list could be any kind of object!

We can use lapply() so that each sheet will be stored as a data.frame inside of a list object. Recall that apply() took in a matrix, a row/column specification (MARGIN), and a function.

lapply(), instead, drops the MARGIN parameter and takes in a vector or a list, a function, and any additional arguments for the function. Remember that lists are a single dimension and thus do not have a row/column configuration.

So far we have been accustomed to functions finding our variables globally (in the global environment), lapply() is looking locally (within the function) and so we need to explicitly provide our path. We will get more into local vs global variables in our control flow lesson (lecture 07). For now, just know we can read in all worksheets from an excel workbook.


2.3.1 The finer details of lapply()

Remember the parameters of

`read_excel(path, sheet = NULL, range = NULL)`

Notice that the second position parameter is sheet. In our lapply() function assignment we didn't specifically name that parameter! Recall we used

`lapply(X= excel_sheets("data/miscellaneous.xlsx"), FUN = read_excel, path = "data/miscellaneous.xlsx")` 


and thus explicitly named our first parameter path. The next available parameter by default order was sheet to which the elements of X were applied. We now have a list object with each worksheet being one item in the list.

You can subset the data.frame you would like to work with using the syntax list[[x]] and store it as a variable using data.frame().

Working with lists of data.frames: can be cumbersome but applying multiple procedures to these objects can be made easier with the purr package which extends the abilities of R to associate and run functions on elements from a list.

2.3.2 A tibble is essentially a data.frame

Notice that the object type of our imported sheet isn't exactly a data.frame. Rather it is a tibble which is an extended version of the data.frame. Overall a tibble replicates the same behaviours as a data.frame except when printing/displaying (only output the first 10 rows vs. all) and in how we subset a single column. As long as you use methods from within the tidyverse, this construct will work just fine.

Subsetting a tibble using the index notation [, 1] returns a tibble object containing the first column of your data. In a data.frame, this same notation would return a vector object. This can sometimes cause type-errors when working with older functions or packages outside the tidyverse.

If you'd like to exclusively work with a data.frame, you can cast it using the data.frame() command.

At this point, you will be able to use your excel worksheet as a normal data.frame in R. Notice above that our abundance column is identified as a character variable? How could we convert that to a numeric or double?

If you are a googlesheets person, there is a package (surprisingly called 'googlesheets4') that will allow you to get your worksheets in and out of R. For more information on googlesheets, checkout more at the tidverse/googlesheets4 page


3.0.0 Inspecting your data

inspecting_data.png Image courtesy of xkcd

We'll often make assumptions about our datasets, like all of the values for a variable are within a certain range, or all positive. We also usually assume that all of the entries in our data are complete - no missing values or incorrect categories. This can be a bit of a trap - especially in large datasets were we cannot view it all by eye. Here we'll discuss some helpful tools for inspecting your data before you start using more complex code for it.


3.1.0 Helpful commands for inspecting your data

When first importing data (especially from outside sources) it is best to inspect it for problems like missing values, inconsistent formatting, special characters, etc. Here, we'll inspect our dataset, store it in a variable, and check out the structure by reviewing some helpful commands:

  1. class() to quickly determine the object type. You see this information in the str() command too.
  2. head() to quickly view just the first n rows of your data.
  3. tail() to quickly view just the last n rows of your data.
  4. unique() to quickly view the unique values in a vector or similar data structure.
  5. glimpse() and View() (in RStudio) to take a peek at your data structures.

3.1.1 Use head() to view the first portion of your data

You can take a look at the first few rows (6 by default) of your data.frame using the head() function. In fact you can play with the parameters to pull a specific number of rows or lines from the start of your data.frame or list.


3.1.2 Use tail() to view the latter portion of your data

Likewise, to inspect the last rows, you can use the tail() function. Again, you can decide on how many rows you'd like to see from the end of your object.


3.1.3 Use unique() to retrieve a list of the unique elements within an object

You may be interested in knowing more about the data set you're working with such as "How many different genera turned up in our entire experiment?" Recall that we have a column labeled genus within our data set microbes.

You could extract the whole column and scan through it or look at just a portion of it.


As you may have noticed, this method printed the entire genus column. While useful information for certain aspects, it doesn't answer our main question of how many different genera turned up in our sampling.

The function unique() can help us answer this question by removing duplicated entries, thus living up to its name. It can take in a number of different objects but usually returns an object of the same type that it was given as input.

Let's take a look at using it on our question.


3.1.4 Use length() or str() to retrieve the size of some objects

Note from above that we have only one entry per genus, but how many genera are there in total? Here we introduce length() which does just as it implies by returning the length of a vector, list, or factor. You can also use it to set the length of those objects but it's not something we have reason to do.

On the other hand str() always gives us the same kind of information plus a little more. Later on, we'll see that more isn't always better and that using length() has its advantages.


Using unique() we are returned a character vector containing 251 genera.

3.1.5 glimpse() and View() show us our data

Suppose we want to see more of our data frame. There are a couple of choices that can be used outside of Jupyter Notebook. In RStudio you have access to your Environment pane which can give you a quick idea of values for variables in your environment, including a bit of what your data.frame looks like.

Clicking on a data object like microbes will generate a new tab that shows your entire data.frame in a human-readable format similar to an Excel spreadsheet. The same result can be accomplished by using the view command View(microbes).

The glimpse() command brings up a comprehensive summary of your object that looks very similar to the information provided in the Environment pane. You'll find it looks very much like the str() command but is formatted in a more human-readable way. It tries to provide as much information as possible in a small amount of space.

We can use this command in Jupyter so let's take a glimpse at glimpse().

So the information provided by glimpse() is more sparse, the formatting is a little tighter and we don't have to see the extra column information as with str(), which can save a lot of vertical space. On the other hand, the command takes longer to type but that's a personal choice.


3.2.0 Special data: NA and NaN values

What happens when you import data with missing values? These could be empty entries in a CSV file or blank cells in a xlsx file. Perhaps, as we'll see later it could be a specifically annotated entry like "No_Data". These are usually the result of missing data points from an experiment but could have origins in other reasons like low-threshold values depending on the source of your data.

Missing values in R are handled as NA or (Not Available). Impossible values (like the results of dividing by zero) are represented by NaN (Not a Number). These types of values can be considered null values. These two types of values, especially NAs, have special ways to be dealt with otherwise it may lead to errors in functions that we frequently use.

Let us begin by building an example containing NA values.


3.2.1 Some functions can be told to ignore or remove NA values

Some mathematical functions can ignore NA value by setting the logical parameter na.rm = TRUE. Under the hood, if the function recognizes this parameter, it will remove the NA values before proceeding to perform its mathematical operation.


3.2.2 What happens when we try to use functions via apply() on data with NAs?

Now, I am going to take the counts data from lecture 01 and add a few NAs. If I now try to calculate the mean number of counts, I will get NA as an answer for the rows that had NAs.

Recall: we can pass additional parameters to apply() that are meant as parameters for our function FUN. So all we have to do is update the code appropriately to include the 'na.rm=TRUE' parameter.

3.2.3 Use the is.na() function to check your data

How do we find out ahead of time that we are missing data? Knowing is half the battle and is.na() can help us determine this with some data structures. The is.na() function can search through data structures and return a boolean structure of the same dimensions.

With a vector we can easily see how some basic functions work.


3.2.4 The any() function evaluates logical vectors

In the case of large data frames, as you can see there are just too many entries to identify. Sometimes we are just interested in knowing if at least one of our logical values matches to TRUE. That is accomplished using the any() function which can evaluate multiple vectors (or data.frames), answering which of those has at least one TRUE value.

We can use it to quickly ask if our microbes data frame has any NA values.


Now we've confirmed that there are some NA values in our data. Given that there are 6656 rows, we need to find a way to identify those rows with NA values and conversely those without NA values. Let's start with simple structures.

3.2.5 Find what you're looking for with the which() function

Using is.na() we were returned a logical vector of whether or not a value was NA. There are some ways we can apply this information through different functions but a useful method applicable to a vector of logicals is to ask which() positional indices return TRUE.

In our case, we use which() after checking for NA values in our object.


3.2.6 Use complete.cases() to query larger objects

We have verified in many ways that we have at least one NA value in counts. Often we may wish to drop incomplete observations where one or more variables is lacking data. Using the which() function would be helpful but, as we can see from our above example, it only returns the element order for the whole data.frame. Instead, we want to look for rows that have any NA values. If you were only concerned with NA values in a specific column of your dataframe, which() would be a good way to accomplish your task.

In the case of removing any incomplete rows, the function complete.cases() looks by row to see whether any row contains an NA and returns a boolean vectors representing each row within the dataframe. You can then subset out the rows with the NAs using conditional indexing.

Conditional indexing: That's right! We used conditional indexing above in section 2.1.5.5 to remove NA values from our na_vector. A data structure of booleans (TRUE and FALSE) can be used to select elements from within another data structure, as long as the relevant dimensions match!

3.2.7 Consider just replacing the NAs with something useful

Depending on your data or situation, you may want to include rows (observations) even though some aspects may be incomplete. Instead, consider replacing NAs in your data set. This could be replacement with a sample average, or the mode of the data, or a value that is below a threshold.

More about NA values: To learn about a few more functions that you can use to identify and remove NA values from your data structure, check out the Appendix at the end of this lecture.

4.0.0 A quick introduction to the dplyr (DEE ply er) package

Now that we've inspected our data for various pitfalls, we can move on to filtering and sorting. To be able to answer any questions with our data, we need the ability to select and filter parts of our data. This can be accomplished with base functions in R, but the dplyr package provides a more human-readable syntax.

making_progress.png Image courtesy of xkcd

The dplyr package was made by Hadley Wickham to help make data frame manipulation easier. It has 5 major functions:

  1. filter() - subsets your data.frame by row
  2. select() - subsets your data.frame by columns
  3. arrange() - orders your data.frame alphabetically or numerically by ascending or descending variables
  4. mutate(), transmute() - create a new column of data
  5. summarize() or summarise() - reduces data to summary values (for example using mean(), sd(), min(), quantile(), etc)

4.1.0 Use conditionals to specify subsets of your data based on criteria

It is often extremely useful to subset your data by some logical condition. We've seen some examples above where we used functions and code to identify and keep specific rows. Let's dig deeper into that topic.

Conditionals ask a question about one or more values and return a logical (TRUE or FALSE) result. Here's a quick table breaking down the uses of basic conditional statements.

Logical operator Meaning Example Result
== equal to "this" == "that" FALSE
!= not equal to 4 != 5 TRUE
> greater than 4 > 5 FALSE
>= greater than or equal to 4 >= 5 FALSE
< less than 4 < 5 TRUE
<= less than or equal to 4 <= 5 TRUE

Mastering the meaning and use of these logical operators will go a long way to helping you in your data science journey!


4.1.1 Use the %in% syntax to compare sets

Sometimes the simplest kind of conditional can be thought of as comparing two sets of data. Which values in A exist in B? For example, we may want to keep all rows that have either Smithella OR Methanobacteria.

To accomplish this using basic functions in R, we turn to the match binary operator, %in%, which can ask for us does x contain any elements present in y using the syntax x %in% y. This operator usually returns a logical vector matching the size of x with TRUE values if the element from x is in y.

Let's see what that looks like in the context of our above question.


4.2.0 Use the filter() function to replicate %in% and more!

From our query above we already know we were asking R to search through our data frame under the genus column for any matches to Smithella OR Methanobacteria. The notation, however, can be a little confusing whereas the filter() function can accomplish the same task in a more human-readable syntax.

Using the filter() function we can evaluate each row with our criteria. Our first argument will be our data.frame, followed by the information for the rows we want to subset by. Notably, filter() drops any NA rows/values that might result from our comparisons. Why is that important?


4.2.1 Slicing and filtering your data requires the proper use of logical operators

Our code produced an empty tibble because we used the logical operator & (AND). For us it makes sense to want only Smithella AND Methanobacteria, but to R it won't make sense because a genus can't be both Smithella AND Methanobacteria at the same time. That's why we need to use the | (OR) operator to select everything that is Smithella OR Methanobacteria. Here's a handy summary about the logical operators.

Operator Description Use or Result
! Logical NOT Converts boolean results into their opposite
& Element-wise logical AND Perform element-wise AND result having length of the longer operand
&& Logical AND Examines only the first element of the operands resulting into a single length logical vector
| Element-wise logical AND Perform element-wise AND result having length of the longer operand
|| Logical OR Examines only the first element of the operands resulting into a single length logical vector


Logical operators To summarize for "&" it will return TRUE if all elements in that single comparison are TRUE while "|" will return TRUE if any elements in that single comparison are TRUE. This logic is applied between index-matched elements and can be combined into more complex statements!

</div>

Now, let's try that filter() command again.